Cross-Document Pattern Matching
نویسندگان
چکیده
We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem.
منابع مشابه
Entropy-based pattern matching for document image compression
In this paper, we introduce a pattern matching algorithm used in document image compression. This pattern matching algorithm uses the cross entropy between two patterns as the criterion for a match. We use a physical model which is based on the nite resolution of the scanner (spatial sampling error) to estimate the probability values used in cross entropy calculation. Experimental results show ...
متن کاملBitmap reconstruction for document image compression
We introduce a pattern matching algorithm and a bitmap reconstruction method used in document image compression. This pattern matching algorithm uses the cross entropy between two patterns as the criterion for a match. We use a physical model which is based on the nite resolution of the scanner (spatial sampling error) to estimate the probability values used in cross entropy calculation. The ma...
متن کاملEntity Profile Extraction from Large Corpora
Information Extraction (IE) has two anchor points: (i) entity-centric information leads to an Entity Profile (EP); (ii) action-centric information leads to an Event Scenario. Based on a pipelined architecture which involves both document-level IE and corpus-level IE, a multi-level modular approach to EP extraction from large corpora is described: (i) named entity tagging; (ii) three-level patte...
متن کاملReputation Extraction Using Both Structural and Content Information
We propose a new method of extracting texts related to a given keyword from Web pages collected by a search engine. By combining structural pattern matching and text classification, texts related to a given keyword such as reputations of a given restaurant can be extracted automatically from Web pages in unfixed sites, which is impossible by conventional wrappers. According to our cross validat...
متن کاملA Codebook Generation Algorithm for Document Image Compression
Pattern-matching based document compression systems rely on finding a small set of patterns that can be used to represent all of the ink in the document. Finding an optimal set of patterns is NP-hard; previous compression schemes have resorted to heuristics. We extend the cross-entropy approach, used previously for measuring pattern similarity, to this problem. Using this approach we reduce the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Discrete Algorithms
دوره 24 شماره
صفحات -
تاریخ انتشار 2012